SAS Code
libname datum '~/../myshortcuts/SAS/CPS_Project/perf';
data demog;
set datum.performancetask_demog;
run;
data finstat;
set datum.performancetask_finstat;
run;
data studentDataRaw;
merge demog finstat;
by student_unique_id;
if grade_level = '09' or grade_level = '10' or grade_level = '11' or grade_level = '12';
if (leave_date < input('2015-06-01',yymmdd10.) or entry_date > input('2016-06-30',yymmdd10.)) then delete;
/* create empty values */
numerator=.;
denominator=.;
/* set inital denominator values */
if leave_date >= input('2015-06-01',yymmdd10.) then denominator = 1;
/* if still enrolled then include in denominator and exclude from numerator*/
if leave_date >= input('2016-06-30',yymmdd10.) then denominator = 1;
if leave_date >= input('2016-06-30',yymmdd10.) then numerator = 0;
/* define denominator values for leave_codes */
if leave_code = '31' then denominator = 1;
else if leave_code = '32' or leave_code = '33' or leave_code = '34' then do;
if (
verified_transfer = 'yes'
or verified_transfer = 'unknown'
or leave_date >= input('2016-02-01',yymmdd10.)
) and denominator = . then denominator = 0;
else denominator = 1;
end;
else if leave_code = '35' and denominator = . then denominator = 0;
else if leave_code = '40' and denominator = . then denominator = 0;
else if leave_code = '41' and denominator = . then denominator = 0;
else if leave_code = '52' and denominator = . then denominator = 1;
else if leave_code = '53' and denominator = . then denominator = 1;
else if leave_code = '55' and denominator = . then denominator = 1;
else if leave_code = '67' and denominator = . then denominator = 1;
else if leave_code = '86' and denominator = . then denominator = 1;
else if leave_code = '87' and denominator = . then denominator = 1;
else if leave_code = '88' and denominator = . then denominator = 1;
else if leave_code = '99' and denominator = . then denominator = 1;
/* define numerator values for leave_codes */
if leave_code = '31' then numerator = 1;
else if leave_code = '32' or leave_code = '33' or leave_code = '34' then do;
if (
verified_transfer = 'yes'
or verified_transfer = 'unknown'
or leave_date >= input('2016-02-01',yymmdd10.)) and denominator ^= 1 then numerator = 0;
else numerator = 1;
end;
else if leave_code = '35' and numerator =. then numerator = 0;
else if leave_code = '40' and numerator = . then numerator = 0;
else if leave_code = '41' and numerator = . then numerator = 0;
else if leave_code = '52' and numerator = . then numerator = 1;
else if leave_code = '53' and numerator = . then numerator = 1;
else if leave_code = '55' and numerator = . then numerator = 0;
else if leave_code = '67' and numerator = . then numerator = 1;
else if leave_code = '86' and numerator = . then numerator = 1;
else if leave_code = '87' and numerator = . then numerator = 1;
else if leave_code = '88' and numerator = . then numerator = 1;
else if leave_code = '99' and numerator = . then numerator = 1;
run;
data studentDataErrors;
set studentDataRaw;
if denominator = . or numerator = .;
run;
data studentData;
retain school_unique_id student_unique_id grade_level student_gender leave_code verified_transfer numerator denominator;
set studentDataRaw (keep=school_unique_id student_unique_id grade_level student_gender leave_code verified_transfer numerator denominator);
if numerator ^=. and denominator ^= .;
run;
/* create district level dataset */
proc sql;
create table districtData as
select
(
sum(numerator)/sum(denominator)
) as overall_dropout_rate,
(
sum((student_gender eq 'FEMALE')*numerator)/sum((student_gender eq 'FEMALE')*denominator)
) as female_dropout_rate,
(
sum((student_gender eq 'MALE')*numerator)/sum((student_gender eq 'MALE')*denominator)
) as male_dropout_rate
from studentData;
quit;
/* create school level dataset */
proc sql;
create table schoolData as
select
school_unique_id,
(
sum(numerator)/sum(denominator)
) as schoolwide_dropout_rate,
(
sum((student_gender eq 'FEMALE')*numerator)/sum((student_gender eq 'FEMALE')*denominator)
) as female_dropout_rate,
(
sum((student_gender eq 'MALE')*numerator)/sum((student_gender eq 'MALE')*denominator)
) as male_dropout_rate
from studentData
group by school_unique_id;
quit;
proc export data=studentDataRaw
outfile='~/../myshortcuts/SAS/CPS_Project/studentDataRaw.csv'
dbms=csv
replace;
run;
proc export data=studentDataErrors
outfile='~/../myshortcuts/SAS/CPS_Project/studentDataErrors.csv'
dbms=csv
replace;
run;
proc export data=districtData
outfile='~/../myshortcuts/SAS/CPS_Project/districtData.csv'
dbms=csv
replace;
run;
proc export data=schoolData
outfile='~/../myshortcuts/SAS/CPS_Project/schoolData.csv'
dbms=csv
replace;
run;
proc delete data=work.demog work.finstat work.studentDataRaw;
Data files produced by the SAS code:
studentData.csv- File containing only data columns requested in the packet.studentDataRaw.csv- File containing all student data and the computed numerator and denominator values.studentErrors.csv- File containing all students who were not assigned a numerator or denominator by our SAS code.districtData.csv- File containing our drop-out rate computations for the entire district.schoolData.csv- File containing our drop-out rate computations for each school.
Data Quality
Tools used for Data QA
To check the quality of the data, I will be using R. To check the data for errors, we will be using the studentDataRaw data set generated by our SAS code. We will refer to it simply as studentData.
studentData <- read.csv("studentDataRaw.csv")
schoolData <- read.csv('schoolData.csv')
districtData <- read.csv('districtData.csv')
errors <- read.csv('studentDataErrors.csv')
For data quality check, we
- Chose a handful of students with different
leave_codeand verify numerator/denominator indicators by hand. - Analyze the
studentDataErrorsdataset generated by our SAS script which contains 29 errors. - Double check our
schoolDatacalculations to see if they make sense and if there are any errors.
In this presentation we carry out bullets 2 and 3. We also discuss ways we may consider in improving our metric.
studentDataErrors
There are 29 students in the error file generated by our SAS code. Below is a summary of the data.
## STUDENT_UNIQUE_ID SCHOOL_UNIQUE_ID grade_level STUDENT_GENDER ## 244021956: 1 1440252:10 Min. : 9.00 FEMALE:14 ## 248634831: 1 1440079: 5 1st Qu.:10.00 MALE :15 ## (Other) :27 (Other):14 Median :12.00 ## Mean :11.21 ## 3rd Qu.:12.00 ## Max. :12.00 ## leave_date leave_code verified_transfer numerator ## Min. :2015-08-14 L:29 N/A:29 Mode:logical ## 1st Qu.:2015-09-22 NA's:29 ## Median :2016-03-10 ## Mean :2016-01-31 ## 3rd Qu.:2016-05-13 ## Max. :2016-06-28 ## denominator ## Min. :1 ## 1st Qu.:1 ## Median :1 ## Mean :1 ## 3rd Qu.:1 ## Max. :1
Observations
By analyzing the summary table, we observe the following:
- Each of these students has
leave_code = Landverified_transfer = 'N/A', as well asdenominator = 1. Our SAS code is not programmed to deal with this leave code. - The range of
leave_dateis from Aug 14 2015 to Jun 28 2016. This means that these students left during the school year.
Investigation
We observed that leave_code = L and verified_transfer = 'N/A'. Based on the contingency table below, we see that there are other rows of data which meet this criteria though were assigned a numerator:
| 31 | 32 | 33 | 34 | 35 | 40 | 41 | 52 | 55 | 67 | 86 | 87 | 99 | L | N/ | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N/A | 71212 | 1204 | 0 | 0 | 0 | 783 | 265 | 77 | 5 | 3125 | 266 | 3502 | 1956 | 353 | 23991 | 9764 |
| NO | 0 | 0 | 908 | 723 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| UNKNOWN | 0 | 0 | 22 | 223 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YES | 0 | 0 | 720 | 4456 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The range of leave_date for the students who did not receive a numerator is August 14 2015 to June 28 2016, which means these students left during the school year. Is this significant? Below is a frequency chart of leave_date for all students with leave_code=L and verified_transfer = N/A.
The rest of the students with the same leave code and verified_transfer have a leave date between July 01 2016 and July 01 2018
Possible Solutions
Possible solutions are:
- Contact schools for further information about the student.
- Edit metric definition/code system to reflect the situation these students fall under.
schoolData
Below shows a histogram of the School-wide drop-out rate computed by our SAS code.
65 schools have a drop-out rate equal to 1. Why?
Further investigation is needed to determine why these schools were rated so high. We leave it as an exercise for the reader.
| school_unique_id | schoolwide_dropout_rate | female_dropout_rate | male_dropout_rate | number.of.students |
|---|---|---|---|---|
| 1440097 | 1 | NA | 1 | 1 |
| 1440176 | 1 | 1 | 1 | 4 |
| 1440216 | 1 | 1 | 1 | 5 |
| 1440266 | 1 | NA | 1 | 1 |
| 1440511 | 1 | 1 | 1 | 13 |
| 2195294 | 1 | 1 | 1 | 4 |
QUESTION FROM SCHOOL
Question
"John Smith (Student Unique ID 256595825) left our school to transfer to another Chicago Public School. We can see on the student system that, John enrolled at Potter High School (School Unique ID 2198052) on 10/14/2016 and remains enrolled. Why was he counted as a dropout in the 2015-2016 dropout rate?"
Answer
Below is John Smith's data from our SAS script.
| STUDENT_UNIQUE_ID | SCHOOL_UNIQUE_ID | leave_code | leave_date | verified_transfer | numerator | denominator | |
|---|---|---|---|---|---|---|---|
| 4449 | 256595825 | 2195093 | 31 | 2016-05-24 | N/A | 1 | 1 |
John Smith has a leave code of 31, which tells us he transferred to another school. John is listed as having not enrolled (transfer status is N/A) in a new school by June 30th, 2016, which is why he is included in the metric.
Changes to the metric
Something to consider when defining metrics is how the metric varies over the population sample. Interact with the graph below by clicking/double clicking on the legend.